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DETAILED ACTION 

Response to Amendment 

1. Examiner enters applicant's arguments, filed 12/16/2005 regarding Office Action 

of 11/03/2005, amend of the specification, page 4 lines 13-15, amended claims 1, 8-9, 15, and 
18-19; and cancelled claim 3. 

Continued Examination Under 37 CFR LI 14 

2. A request for continued examination under 37 CFR 1.114, including the fee set forth in 
37 CFR 1 .17(e), was filed in this application after final rejection. Since this application is 
eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1 .17(e) 
has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 
37 CFR 1.114. Applicant's submission filed on 3/15/2006 has been entered. 

Response to Arguments 

3. Applicant's arguments filed 03/15/2006 have been fully considered but they are not 
persuasive. 

Applicant argues with regard to claims under 35 USC 102 (a), Basu et al. (referred to as 
Basu) (6,219,640) is limited to using the N best entities after noise determination processing. 
Examiner respectfully disagrees. Basu discloses combined audio-visual feature vectors (col. 12 
lines 50-54), thus the correlation values are determined, the sum of the elements of subsets 
between audio and object features are in the extracted visual speech feature vectors (V) from 
extractor 22 and the acoustic feature vectors (A) from extractor 14, the AV utterance verifier 28 
performs verification, involving comparisons of the resulting likelihood of aligning the audio on 
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a random sequence of visemes, which are visual phonemes, generally mouth shapes that 
accompany speech utterances which are categorized and pre-stored similar to acoustic phonemes, 
utterance verification is to determine speech used to verify speaker in audio path I and the visual 
cues used to verify the speaker in the video path II correlate or align, col. 1 1 lines 10-3 1 and col. 
7 lines 6-26. 

In response to applicant's argument that there is no suggestion to combine the references, 
the examiner recognizes that obviousness can only be established by combining or modifying the 
teachings of the prior art to produce the claimed invention where there is some teaching, 
suggestion, or motivation to do so found either in the references themselves or in the knowledge 
generally available to one of ordinary skill in the art. See In re Fine, 837 F.2d 1071, 5 
USPQ2d 1596 (Fed. Cir. 1988)and In re Jones, 958 F.2d 347, 21 USPQ2d 1941 (Fed. Cir. 1992). 
In this case, Nevenka et al. teach audio features that provide a system that passively records and 
identifies various events that occur in the home or office and can index the events using 
information, this way, a user can easily retrieve individual events and sub events using plain 
language commands or the processing system can determine whether an action is necessary in 
response to the identified event, page 2 paragraph 27. 

Claim Rejections - 35 USC § 103 
4. The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in 
section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are 
such that the subject matter as a whole would have been obvious at the time the invention was made to a person 
having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the 
manner in which the invention was made. 



Application/Control Number: 10/076,194 



Art Unit: 2626 



Page 4 



5. Claims 1, 2, 4, 5, 8, 1 1, and 16-17 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Basu et al. (6,219,640) in view of Nevenka (2003/0108334). 

As per claim 1, Basu et al. teach an audio-visual processing data (col. 13, lines 55-58) 
comprising; 

an object detection module capable of providing a plurality of object features from the 
video data (Fig. 4 elements 10, 20, and 24; and col. 6 lines 48-51; col. 10 lines 12-25); 

an audio processor module capable of providing a plurality of audio features from the 
video data (col. 3 lines 53-59; col. 4 lines 58-67 and col. 8 lines 42-46); 

a processor coupled to the object detection and the audio segmentation modules (col. 13 
lines 31-41; col. 1 1 lines 10-14; and col. 6 lines 33-48), arranged to determine a maximum 
correlation value among a plurality of correlation values between the plurality of object features 
and plurality of audio features (a level of correlation between the signals, col. 2, lines 35-36) 
wherein said correlation values are determined as the sum of the elements in a subset of said 
audio features (col. 9 lines 40-49, col. 10 lines 1-11; and col. 9 lines 15-34; maximum score is 
calculated from terms of the inner sum approach, selects the highest and second highest score, 
from the top scores of the face identification process, the identification, which includes audio and 
video, of the speaker is known); 

Basu et al. do not explicitly teach the audio features consisting of: two or more of the 
following: average energy, pitch, zero crossing, bandwidth, band central, roll off, low ratio, 
spectral flux, or 12 MFCC components. 
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However, Nevenka et al., teach feature extraction from a list consisting of energy, pitch, 
and bandwidth (Fig. 2 "processor extracts feature audio streams" page 6 page 6 paragraph 65 
lines 9-11 "audio parameters... energy, pitch, and bandwidth"). 

Therefore, it would have been obvious for one of ordinary skill at the time 
of invention to combine Basu et al.'s audio and visual speaker recognition into the adaptive 
environment system of Nevenka et al., because Nevenka et al. teach that this would provide a 
system that passively records and identifies various events that occur in the home or office and 
can index the events using information, this way, a user can easily retrieve individual events and 
sub events using plain language commands or the processing system can determine whether an 
action is necessary in response to the identified event, page 2 paragraph 27 lines 17-24. 

As per claim 2, which depends on claim 1, Basu et al. teach a processor arranged to 
determine whether an animated object in the video data is associated with audio (determine the 
level of correlation between the signals, col. 2, lines 35-36). 

As per claim 4, which depends on claim 2, Basu et al. teach that the animated object is a 
face (locate and track a face, other facial features, col 4, Lines 12-13 ) and where the processor is 
arranged to determine whether the face is speaking (phonetic/visemic information from 
the geometry of the lip contour and its time dynamics, col. 10, Lines 53-55). 
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As per claim 5, which depends on claim 4, Basu et al. teach wherein the plurality of 
object features are eigenfaces that represent global features of the face (in "Distance from Face 
Space" DFFS, Lines, col 7. lines 32-35, feature vectors, col. 8, lines 7-8). 

As per claims 8, 15 and 16, Basu et al. teach identifying a speaking person (speaker 
recognition and utterance verification, title) within video data, the method comprising: 

- receiving video data including image (fig 1, element 4) and audio (figure 1, 
element 6) information, 

- determining a plurality of face image features from one or more faces in the video data 
(sub-features, hairline, chin mouth, eyes, eyebrows, col 7, lines 55-57), determining a plurality of 
audio features related to audio information (extracts spectral features, col. 4, lines 61-63), 

calculating correlation values between the plurality of face image features and the audio 
features (a level of correlation between the signals, lines 34-35), and 

determining the speaking person based on a maximum of the correlation values (highest 
score identified as the speaker, col 10, lines 10-11; col. 9 lines 40-49, col. 10 lines 1-5, and col. 7 
lines 6-25). 

wherein said correlation values are determined as the sum of the elements of a subset 
between said audio features and selected object features (col. 9 lines 40-49, col. 10 lines 1-5, col. 
7 lines 6-25, col. 1 1 lines 10-31, col. 8 lines 5-20, and col. 9 lines 15-34). 

As per claim 1 1 and 17, which depend on claims 8 and 16, Basu et al. teach a 
determining step where it includes determining the speaking person based upon the one 
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or more faces that has the largest correlation (highest combined score is identified as the speaker 
col 10, lines 10-11). 

6. Claims 6-7 are rejected under 35 USC 103(a) as being unpatentable over Basu et al. 
(6,219,640) in view of Nevenka (2003/0108334) in further view of Bradford et al 
(2002/0103799). 

As per claim 6, which depends on claim 1, Neither Basu et al. nor Nevenka et al. 
explicitly teach a latent semantic indexing (LSI) module (coupled to the processor) that 
preprocesses the plurality of object features and the plurality of audio features before the 
correlation is performed. 

However, Bradford teaches that to latent semantic indexing can be used to process both 
audio and text information vectors (para. 0079, lines 8-10). 

Therefore, it would have been obvious for one of ordinary skill at the time of invention to 
combine the Audio-Visual speaker recognition into the adaptive environment of Basu et al. in 
view of Nevenka, into the audio and visual comparison of Bradford, because an artisan of 
ordinary skill in the art would want to provide a meaningful description of equivalents, (Bradford 
para. 0079). 

As per claim 7, which depends on claim 6, Neither Basu et al. nor Nevenka et al. 
explicitly teach a latent semantic indexing module including a singular value decomposition 
(SVD) module. 
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However, Bradford teaches using a SVD module (figure 2, para(0029)) to reduce term x 
Doc matrix to a product of three matrices.) 

Therefore, it would have been obvious for one of ordinary skill at the time of invention to 
combine the Audio-Visual speaker recognition into the adaptive environment of Basu et al. in 
view of Nevenka, into the audio and visual comparison of Bradford, because an artisan of 
ordinary skill in the art would want to provide reduced matrix to a product of three matrices, 
(Bradford para 29). 

7. Claim 9-10,12-14, and 18-20 are rejected under 35 USC 103(a) as being 
unpatentable over Basu et al. (6,219,640) in view of Nevenka et al. (2003/0108334), in further 
view of Wang et al. (Multimedia Content Analysis). 

As per claim 9, which depends on claim 8, Neither Basu et al. nor Nevenka et al. 
explicitly teach normalizing the vectors containing the video/audio features. 

However, Wang et al. teach normalizing these vectors (normalized correlation matrix pg 
20, lines 2). 

Therefore, it would have been further obvious to one having ordinary skill in the art at the 
time of invention to combine the Audio- Visual speaker recognition into the adaptive 
environment of Basu et al. in view of Nevenka, into the Multimedia Content Analysis of Wang et 
al, because an artisan of ordinary skill in the art would want to better interpret the correlation, if 
any, that exists between the feature vectors, to see if they provide independent information, 
(Wang et al., pl9, col 2, Lines 1-5). 
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As per Claims 10 and 18, which depend on claims 8 and 15, Neither Basu et al. nor 
Nevenka et al. explicitly teach performing a singular value decomposition on the normalized face 
image features and audio features. 

However, Wang et al. do teach SVD on a normalized correlation matrix (pg 20, col 1, 
line 1 and col 2, Lines 4-5 (KLT-Karhunen Loeve transform). 

Therefore, it would have been obvious for one of ordinary skill at the time of invention to 
combine the Audio-Visual speaker recognition into the adaptive environment of Basu et al. in 
view of Nevenka, into the Multimedia Content Analysis of Wang et al, because an artisan of 
ordinary skill in the art would want to decorrelate the features with KLT, (Wang et al. page 20 
col. 2 para 1). 

As per Claims 12 , Neither Basu et al. nor Nevenka et al. explicitly teach a calculating 
step which includes forming a matrix of the face image features and the audio features. 

However, Wang et al. do teach combining the two in a single matrix (14 audio features, 
last six motion features, figure 9 and pg 20, Lines 8-10). 

It would have been further obvious to one having skill in the art at the time of invention 
to combine the Audio- Visual speaker recognition into the adaptive environment of Basu et al. in 
view of Nevenka, into the Multimedia Content Analysis of Wang et al, because an artisan of 
ordinary skill in the art would want the dependence among features within the same and across 
different modalities to be computed (Wang et al. pg 19, lines 5-8). 



As per Claims 13 and 19, which depends on claims 12 and 18, Neither Basu et al. 



Application/Control Number: 10/076,194 Page 10 

Art Unit: 2626 

nor Nevenka et al. explicitly teach performing an optimal approximate fit using smaller matrices 
as compared to full rank matrices formed by the face image features and audio features. 

However, Wang et al. do teach using SVD to allow for dimensionality reduction (pg 10, 
Lines 18-19). 

It would have been further obvious to one having skill in the art at the time of invention 
to combine the Audio- Visual speaker recognition into the adaptive environment of Basu et al. in 
view of Nevenka, into the Multimedia Content Analysis of Wang et al, because an artisan of 
ordinary skill in the art would want to decorrelate the features, (Wang et al. page 20 col. 2 para 
1). 

As per claims 14 and 20, which depends on claims 13 and 19, Basu et al. teach 
choosing the rank of the smaller matrices to remove noise and unrelated information from 
the full rank matrices (col. 14 lines 3 1-59). 

Conclusion 

Any inquiry concerning this communication or earlier communications from 
the examiner should be directed to Myriam Pierre whose telephone number is 571-272-761 1. 
The examiner can normally be reached on Monday - Friday from 5:30 a.m. - 2:00p.m. 

8. If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Richemond Dorvil can be reached on (571) 272-7602. The fax phone number for the 
organization where this application or proceeding is assigned is 571-273-8300. 

9. Information as to the status of an application may be obtained from the Patent 
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Application Information Retrieval (PAIR) system. Status information for published applications 
may be obtained from either Private PAIR or Public PAIR. Status information for unpublished 
applications is available through Private PAIR only. For more information about the PAIR 
system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR 
system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). 
05/18/2006 MP 




